Accumulation of gene copy number variations during the early phase of free‐spawning abalone speciation

Abstract The genetic basis of speciation in free‐spawning marine invertebrates is poorly understood. Although gene copy number variations (GCNVs) and nucleotide variations possibly trigger the speciation of these organisms, empirical evidence for such a hypothesis is limited. In this study, we searched for genomic signatures of GCNVs that may contribute to the speciation of Western Pacific abalone species. Whole‐genome sequencing data suggested the existence of significant amounts of GCNVs in closely related abalones, Haliotis discus and H. madaka, in the early phase of speciation. In addition, the degree of interspecies genetic differentiation in the genes where GCNVs were estimated was higher than that in other genes, suggesting that nucleotide divergence also accumulates in the genes with GCNVs. GCNVs in some genes were also detected in other related abalone species, suggesting that these GCNVs are derived from both ancestral and de novo mutations. Our findings suggest that GCNVs have been accumulated in the early phase of free‐spawning abalone speciation.


| INTRODUC TI ON
A major goal in marine biology is understanding the genetic basis of speciation in free-spawning marine invertebrates (Pogson, 2016).
Genomic analyses based on single nucleotide polymorphism (SNP) loci have identified candidates for genetic variation that contributes to marine speciation (Hirase et al., 2021;Momigliano et al., 2017).
However, a growing body of evidence suggests that gene copy number variations (GCNVs) also play an important role in the ecological divergence of various organisms (Castagnone-Sereno et al., 2019;Hirase et al., 2014;Ishikawa et al., 2019;Pezer et al., 2015) including marine species (Dorant et al., 2020). Therefore, GCNVs, as well as nucleotide variations, may trigger speciation in the ocean; however, empirical evidence for these hypotheses is limited.
In the current study, we focus on GCNVs in the Western Pacific abalones, Haliotis discus, H. madaka, and H. gigantea (Ino, 1952).
These three species are estimated to have diverged recently from North America and are genetically close to the North American abalones (Geiger & Groves, 1999;Hirase et al., 2021). Population genomic analyses of these species revealed that although the three species are genetically distinct, there is evidence of historical and ongoing gene flow among these species. The most closely related pair, H. discus and H. madaka (genome-wide F ST = 0.007), appears to occupy the early stages of the speciation continuum after the initial divergence of H. gigantea (Hirase et al., 2021). In the current study, we searched for GCNVs between H. discus and H. madaka based on whole-genome sequencing (WGS) data. Our results showed the possibility of GCNVs accumulating during their speciation event. To further investigate whether the candidate GCNVs were derived from de novo or standing genetic variations, we examined GCNVs in H. gigantea and North American abalones that are expected to be genetically close to the Western Pacific abalones.
2 | MATERIAL S AND ME THODS 2.1 | Alignment of whole-genome sequencing data WGS data from the Western Pacific (eight H. discus, eight H. madaka, and six H. gigantea) and North American abalone species (three H. rufescens) were used. These were generated using Illumina HiSeq X Ten with the 150-bp paired-end protocol in our previous study (Hirase et al., 2021; (Nam et al., 2017) as described by Hirase et al. (2021). PCR duplicates were removed from constructed BAM files using SAMtools ver. 1.9 markdup (Li et al., 2009).

| Identification of gene copy number variations
We compared the number of mapped reads for each gene, which was predicted in a previous study (Hirase et al., 2021), between H. discus and H. madaka using BAM files (Table A1 in Appendix A), and identified candidate GCNVs by referring to Hirase et al. (2014). If the number of mapped reads was significantly larger in one species, the gene would have been duplicated or multiplied specifically in that species. By contrast, if the numbers were significantly smaller, the gene would have been deleted, or its copy number would have decreased. Briefly, the number of mapped reads that overlapped with predicted gene regions (i.e., any exonic or intronic region) was counted using the featureCounts function of Subread ver. 1.4.6 (Liao et al., 2014). We removed genes onto which no reads were mapped in at least one individual, because an insufficient number of mapped reads may result in the detection of false GCNVs. We then searched for candidate GCNVs by detecting genes that showed significant differences in normalized read numbers between H. discus and H. madaka using the edgeR software package (Robinson et al., 2010) with a false discovery rate (FDR) < 0.01. For normalization with edgeR, the total number of mapped reads across the genome of each individual was used. To confirm that the number of identified GCNVs was significantly larger than expected by chance, we calculated an empirical p-value based on a permutation test. In this test, we randomly reallocated eight H. discus and eight H. madaka individuals into two groups 10,000 times, performed the above analyses for each generated dataset, and obtained the null distribution of the numbers of GCNVs. Gene functions were annotated using BLASTP searches against the NCBI nonredundant (nr) protein database. We con- In this study, SNPs were called for H. discus and H. madaka, H. gigantea using SAMtools mpileup (−Q 30) and bcftools, and filtered using VCFtools (--minQ 20 --remove-indels --maf 0.05 --max-alleles 2 --minDP 6 --max-missing-count 1). This SNP information was used for the detection of three-or-more-different allelic pairs later (see next section).

| Detection of three-or-more-different allelic pairs
Three-or-more-different allelic pairs (≥3 allele sequences) within a gene could be evidence of the GCNVs because ≥3 allele sequences of a gene cannot occur in a diploid genome (Hirase et al., 2014). Therefore, we examined whether ≥3 allele sequences were observed in the candidate GCNVs between H. discus and H. madaka, which were detected by read-depth-based analyses. The ≥3 allele sequences of candidate GCNVs were also identified in H. gigantea.
In this analysis, we enumerated every pair of SNP positions, for each of the identified candidate GCNVs, that were located within 100 bp (within-read-length single nucleotide variation: SNP position pairs) using BAM files, the vcf file, and a custom Perl script (deposited in Dryad). Briefly, the number of different nucleotide pairs for each of the within-read-length SNP position pairs, which were supported by multiple (≥2) reads (by taking into account sequencing error) was counted. Then, we selected candidate GCNVs that were supported at least by one ≥3 allele sequence in at least three individuals from either species ( Figure A1 in Appendix A). The ≥3 allele sequences in candidate GCNVs were also identified in North American abalone species. SNPs of North American abalones were called separately from those of the Western Pacific abalones, using the same method mentioned above, and filtered using VCFtools (−-minDP 10 --remove-indels --max-missing-count 0).

| RE SULTS
Significant differences in the numbers of mapped reads between H. discus and H. madaka (FDR < 0.01) were observed for 627 genes (Figure 1a), and this number was significantly higher than expected by chance (p < .05). This suggests the accumulation of gene copy number differences between the two species. Among these genes, 328 genes were expected to have more copies in H. discus (called as HD-increased GCNVs), and 299 genes were expected to have more copies in H. madaka (called as HM-increased GCNVs). Next, we compared the sliding-window F ST within genes, where GCNVs were estimated, with those of all genes. The level of genetic differentiation in the genes where GCNVs were estimated was higher than that in all genes (Wilcoxon rank sum test; p < .05; Figure 1b).

| DISCUSS ION
Gene copy number variations can cause organisms to inhabit new ecological niches (Ishikawa et al., 2019). We obtained genomic evidence of significant amounts of GCNVs between H. discus and H. madaka, which have different ecological niches and have recently speciated (Hirase et al., 2021). This result suggests that the accumulation of gene copy number differences is present in the early speciation stages of freespawning abalones. In addition, the degree of genetic differentiation in the genes where GCNVs were estimated was higher than that in other genes, consistent with previous observations that many CNVs were found in genes for which SNP-based analyses detected signatures of positive selection (Feulner et al., 2013;Gokcumen et al., 2011;Hirase et al., 2014). This trend possibly suggests that nucleotide divergence also accumulates in the genes with GCNVs in abalones. Alternatively, there may be some biases in the F ST estimates in the genes with GCNVs because alignments of multiple copies to one reference genome can cause the increased SNP variations in these genes (Feulner et al., 2013).
This trend needs to be examined in more detail in the future.
Among the 627 candidate GCNVs detected, those in eight genes were confirmed by detecting ≥3 allele sequences in either H. discus or H. madaka. The method based on ≥3 allele sequences is not suitable for the detection of recently generated gene duplications, but instead robustly detects GCNVs of genetically distant species, because it does not depend on the efficiency of read mapping onto the reference genome, unlike depth-based analysis (Nijkamp et al., 2012). We found that six of the eight genes had ≥3 allele sequences in H. gigantea and/or the North American abalone species.
These results suggest that GCNVs between H. discus and H. madaka, which have recently diverged, are derived from both de novo (Zarrei et al., 2018) and standing genetic variations (Feulner et al., 2013).
Within one gene (STRG.16819), we detected ≥3 allele sequences in more genomic regions in all the North American abalones species than in H. discus and H. madaka, which is consistent with the expectation that nucleotide mutations between standing duplicated genes have accumulated in the ancestral North American abalone species.
Our findings may suggest future issues regarding the genome assembly of abalones. Although the ≥3 allele sequences in one suggested that Hsp70 genes evolved into seven copies in thermotolerant species (Evgen'ev et al., 2004). Hsp genes comprise several families based on their molecular weights (Kampinga et al., 2009).
Among these, sHsp genes are the smallest members of the Hsp superfamily. In H. discus, two sHsp genes, sHsp26 and sHsp20, have been reported. The expression of these genes occurs in multiple tissues and is strongly affected by environmental stress (Park et al., 2008;Wan et al., 2012). In particular, sHsp20 mRNA expression is rapidly elevated upon exposure to thermal, oxidative, and multiple toxic metal stresses (Wan et al., 2012), suggesting that this gene contributes to the adaptation of abalones to various environments. Since there was only one sHsp20 in the genomes of the four North American abalones, the sHsp20 gene duplication may have occurred after the speciation of Western Pacific abalone from the North American abalones F I G U R E 2 Distributions of normalized read depth in two genes possibly encoding small heat shock protein 20 (sHSP), STRG.15773 and STRG.24666. STRG.15773 is an HD-increased gene copy number variation gene that was supported by the difference in the number of mapped reads between eight Haliotis discus and eight H. madaka individuals. Each boxplot represents the normalized read depth and average normalized read depth, respectively, of the mapped reads per 200-bp nonoverlapping window for eight H. discus and eight H. madaka individuals. For normalization, the number of mapped read in each region was divided by the total number of mapped reads across the genome and multiplied by 10 million. Gene models are shown at the bottom of each panel. (Geiger & Groves, 1999;Hirase et al., 2021), and that the copy number of STRG15773 increased specifically in H. discus. Previous gene expression analyses have indicated that the sHsp20 gene is likely involved in abalone defenses against extreme environmental stress (Wan et al., 2012). Compared with other Western Pacific abalones, H. discus inhabits shallow depth zones where environmental fluctuations are more intense because of the influx of freshwater and the effects of varying water temperatures (Sinex, 1994). Therefore, it is possible that sHsp20 is involved in the ecological adaptation of H. discus.
Among the eight GCNVs in which ≥3 allele sequences were detected, one HM-increased GCNV was annotated as a mucin gene (Figure 3). Aquatic invertebrates protect the surfaces of their bodies, gills, and intestines with a mucus layer, which is composed of mucin glycoproteins (Bakshani et al., 2018). The mucus layer serves as an antimicrobial barrier and physical protective layer and has several physiological functions (Stabili, 2019). Copy number variations of mucin genes may affect the adaptive divergence between H. discus and H. madaka. The expression of this gene has been reported to respond to thermal stress in hybrids of H. discus hannai and H. gigantea (Xiao et al., 2021). Given that mucin genes belong to multigenic families (Desseyn et al., 1997), this finding is consistent with the idea that GCNVs of multigenic family genes are more likely to occur than those of single-copy genes (Hirase et al., 2014;Nguyen et al., 2006).
This study provides the first empirical data showing GCNVs in the early phase of marine invertebrate speciation and suggests that GCNVs accumulate in the early phase of marine invertebrate speciation. In addition, some GCNVs were detected in ancestral species, suggesting that GCNVs are derived from both ancestral and de novo mutations.

ACK N OWLED G M ENTS
The authors are grateful to members of the Kikuchi laboratory for their helpful comments on this research.

FU N D I N G I N FO R M ATI O N
This work was supported by the Japan Society for the Promotion of Science (KAKENHI 21580240, 17K19280, 22H00377).

CO N FLI C T O F I NTE R E S T S TATE M E NT
We declare no competing interests.

DATA AVA I L A B I L I T Y S TAT E M E N T
The WGS data used in this study are provided in Table A1 in Appendix A. The perl script for detecting ≥3 allele sequences is deposited in the Dryad Digital Repository: https://datad ryad.org/ stash/ share/ WGpUY JzAGl XloIl Shuop LT65W heRf1 ra1ve ScjvF7vg.